A fundamental public policy challenge in the Organization for Economic Co-operation and Development (OECD) member countries1 has long been the issue of the increasing imbalance between the growing cohorts of older adults not working and the shrinking cohorts of adults in the age range of labour force participation (age range 20-64)(OECD, 2015). The baby boom generation (born 1946 to 1955) grows older and social scientists and policy makers have taken an intense interest in how their aging and eventual retirement from the full-time labour force will affect society. In only 15 years, the share of the population aged 65 and over in the OECD countries has increased by more than 3 percentage points; from 13 per cent in 2000 to more than 16 per cent in 2015 (OECD. Stat, data extracted August 10, 2016). The effect of an aging population on a country's societal support burden is often measured by the older dependency ratio, which is the ratio of the older population to the working-age population. The OECD average older dependency ratio (ratio of individuals aged 65 and above to those aged 20-64) has increased considerably over the last half century, from 17.9 in 1970 to 27.5 in 2015 (OECD. Stat, data extracted August 10, 2016). The problem is more pronounced in Europe than in the US; the older dependency ratio was 25.0 in the US in 2015 and as high as 31.3 in Europe2 in 2015 (OECD. Stat, data extracted August 22, 2016). In addition to the effect of the large baby boom generation growing older, the average duration of expected years in retirement has increased. In 1970, men in the OECD countries spent on average 11 years in retirement and by 2014, this average had increased to almost 18 years (OECD, 2014). The increase for women has been from 15 years in 1970 to 22.3 years in 2014. The increase in average duration of years in retirement is partly due to increased longevity and partly due to earlier retirement. Although the effective age of retirement (the average effective age at which workers withdraw from the labour force) decreased between 1970 and 2001, it slowly started to increase in 2004 (OECD, 2014). In 2014 the effective retirement age was on average 64.6 for men (63.2 for women) in the OECD countries; a bit higher in the US (65.9 for men and 63.2 for women) than in Europe3 (62.9 for men and 61.7 for women) (OECD. Stat, data extracted August 22, 2016). Life expectancy at the effective retirement age has also increased substantially during this period. Recently, the increase in longevity has been fairly equal to that of the effective exit age from the labour market, and potential years in retirement have stabilised (OECD, 2014). By 2050, the population aged 65 and over in the US is expected to grow to almost 21 per cent and the older dependency ratio is estimated to increase to 38 (OECD. Stat, data extracted August 22). In Europe, the percentage of the population aged 65 and over is expected to grow to almost 29 per cent by 2050, and the older dependency ratio is estimated to increase to 55 (OECD. Stat, data extracted August 22). At a societal level, this growing imbalance raises serious concerns about the viability and funding of social security, pensions, and health programmes. At an individual level, the concern is probably more that of aging well with the prospect of many years in retirement. The concept of aging well clearly implies maintaining health and effective functioning. Research suggests, however, that retiring for some carries the risk of a fast decline in health (Dave, Rashad & Spasojevic, 2008; Szinovacz & Davey, 2004). The reason may be that retiring deprives people of the deep-seated needs they have for time structure, social contact, collective effort or purpose, social identity or status, and regular activity, which paid work generally provides (Jahoda, 1981; Jahoda, 1982; Jahoda, 1984). The absence of these latent supportive features of employment may be detrimental to the health of retired workers. In fact, according to Dave, Rashad and Spasojevic (2008), complete retirement leads to rapid declines in mental health, increases in illness, and increases in difficulty performing daily activities. Complete retirement in an early age may thus threaten the ability of individuals to age well and societies as a whole aging well because of the societal burden resulting from health and functional limitations and associated costs. Several studies have demonstrated that subjective usefulness is strongly related to both physical and psychological health (Ranzijn, Keeves, Luszcz, & Feather, 1998; Ryan & Frederick, 1997; Ryff, 1989). The performance of other meaningful (for the individual) activities than working for pay may thus help maintain health and functional ability for older people. Using US data from 1995 and 2005, Einolf (2009) predicted that the baby boom generation's rate of volunteering at the age of retirement (in 2015) would be higher than earlier generations' rate of volunteering. Combined with the large size of the baby boom cohort, Einolf concluded that the total number of older volunteers would increase. The prediction seems to hold true, at least for those countries where it has been possible to locate relevant numbers for the baby boom generation's rate of volunteering. In Canada, the rate of volunteering for those aged 65 and over increased from 32 per cent in 2004 to 36 per cent in 2010 (Vézina & Crompton, 2012), and in Denmark the rate increased from 23 per cent in 2004 to 34 per cent in 2012 (Fridberg & Henriksen, 2014). In the US, programmes have been initiated to integrate the aging population into voluntary work. Some programmes are organised in local non-profit organizations, referred to as Senior Corps Programs. “Senior Corps” is a network of national service programmes that provides the opportunity for people aged 55 years or above to apply their life experience to meeting community needs (see www.seniorcorps.org/rsvp/senior-corps-programs-2/). Specific programmes utilized by Senior Corps include the Foster Grandparents Program, the Retired Senior and Volunteer Program (RSVP), and the Senior Companion Program. The idea of engaging older people is not entirely new, however; the Foster Grandparents Program began as a pilot programme in 1965, the Senior Companion Program began in 1968, and RSVP was created as a nationwide program in 1969. A more recent initiative is the Experience Corps, which began in early 1996. One of the mission statements of this programme is to “provide significant benefits for the older Americans who participate” (Grimm, Spring & Dietz, 2007, p. 26). Volunteering is a complex phenomenon and spans a wide variety of types of activities, organizations and sectors. The intervention of interest in this review is formal volunteering. Formal volunteering can be described as voluntary, on-going, planned, helping behaviour that intend to increase the well-being of strangers, offers no monetary compensation, and typically occurs within an organizational context (Clary et al., 1998; Penner, 2002). We will define formal volunteering centred on four axes (as defined in Hustinx, Cnaan & Handy, 2010). These are: An example of interventions eligible for inclusion is Senior Corps Programs, Foster Grandparents Program, the Retired Senior and Volunteer Program (RSVP), and the Senior Companion Program. Volunteering can play a significant role in people's lives as they move from work to retirement. According to Smith and Gay (2005), retirement is the trigger for volunteering for some older people, as it offers a ‘structured’ means of making a meaningful contribution in society once the opportunity to do so through work has been cut off. Some older people consider voluntary work as a way to replicate aspects of paid work lost upon retirement, such as organisational structure and time discipline (Smith & Gay, 2005). The same line of arguments for volunteering can be found in several other studies (see Chappell & Prince, 1997; Fischer, Mueller & Cooper, 1991; Greenfield & Marks, 2004; Newman, Vasudev and Onawola, 1985; Widjaja, 2010. Volunteering thus seems to provide a way of compensating for the losses due to retirement as identified by Jahoda (1981, 1982 and 1984), such as the need for time structure, social contact, collective effort or purpose, social identity or status, and regular activity. Several studies indeed argue that there is a potential health benefit to older volunteers and in particular retirees (Moen and Fields, 2002; Musick and Wilson, 2003; Young and Glasgow, 1998). The exact mechanisms and processes linking volunteering and health for older people has however not been sufficiently explored and may be very complex (Warburton, 2006). Using an in-depth qualitative approach Warburton (2006) aims to explore this relationship. Warburton (2006) analysed the data, focusing on health. The respondents were not asked directly about the health impact of volunteering though. The study identifies six potential themes and the impacts on health is discussed. The six themes and their impacts on health is illustrated in Figure 1. Mechanisms by which volunteering may affect health Volunteering of the older adults seems to be on the increase and programmes designed specifically for this subpopulation are emerging. Volunteering may contribute to both individuals aging well and society aging well, as volunteering by the older adults at the same time relieves the societal burden if it helps maintain health and functionality for those who volunteer. It thus remains to be established to what extent volunteering impacts on the physical and mental health of those who volunteer. Health status is often found to be an important predictor of volunteering among those aged 65 years or more, see for example Brown (2000) and Young and Glasgow (1998). The question that is important to answer is: Does good health predict volunteering or does volunteering improve health (or maybe both)? Studies that simply assess the association between voluntary work and health outcomes cannot answer this question. Research using appropriate controls and outcome measures can, however, provide some relevant evidence on whether engaging in voluntary work might cause good health outcomes on older people. It is vital that an appropriate comparison group is used to establish the direction of cause. Does volunteering make people healthier, or are healthier people more likely to volunteer? Likewise, it is vital that the health measures are objective. As stated in Wilson and Musick (1999, p. 153): “[C]ross-sectional designs that use participants to self-assess the impact of a volunteer program function as little more than market research for the agency concerned. Without a pre/post-test design and a control group, and without more objective and generalizable outcome measures, little can be learned of the benefits of volunteering from these studies”. The same worries concerning reliance on cross-sectional designs and self-assessment of health to establish causality can be found in Lum and Lightfood (2005). Hence, considering the fact that the population under investigation in this review by nature volunteer into the intervention, we believe it is vital that an appropriate comparison group and access to relevant pre health measures and objective health measures are used to establish causality. We are very clear that firm causal conclusions probably cannot be drawn from the studies we expect to include in the review, as we do not expect to find any studies based on randomised trials. However, a distinction can be drawn between studies that simply assess the association between voluntary work and health outcomes, and studies that control for important confounding factors, in particular pre health measures, and use objective health measures. Studies that control for important confounding factors and use objective health measures provide some evidence for considering possible causal effects. While conclusions about causal effects must be very tentative, it is important to extract and summarize the best evidence available. An obvious question arises: is there any value in conducting a systematic review when it is likely that there are no trial based studies available? We think it is worthwhile as a systematic review may uncover high quality studies that may not be found using less thorough searching methods. Furthermore, if a systematic review demonstrates that high quality studies are lacking, this could encourage a new generation of primary research. Therefore, even though we expect not to find any trial based studies and only a few studies of voluntary work based on appropriate outcome measures and control group comparison, we still believe there is value in conducting the proposed review. The main objective of this review is to answer the following research question: What are the effects of volunteering on the physical and mental health of people aged 65 years or older? It is hard to imagine that a researcher would randomise the allocation of people to volunteer work. We therefore anticipate that relatively few randomised controlled trials on the effects of volunteer work on the health of the volunteers will be found. However, in the unlikely event that a randomised controlled trial is found, it will of course be included in the review. In order to summarise what is known about the possible causal effects of volunteering, we will include all study designs that use a well-defined control group. Non-randomised studies, where voluntary work has occurred in the course of usual decisions outside the researcher's control, must demonstrate pre-treatment group equivalence via matching, statistical controls, or evidence of equivalence on key risk variables and participant characteristics. These factors are outlined in the section Assessment of risk of bias in included studies, and the methodological appropriateness of the included studies will be assessed according to the risk of bias model. The study designs we will include in the review are: The “intervention population” are people aged 65 or over who are engaged in formal voluntary work. Studies where the majority of participants are aged 65 or over, or where results are shown for subgroups of participants aged 65 or over, will be included. We will include voluntary workers of both genders and all nationalities who perform all types of formal voluntary work as defined in the Intervention section. The intervention of interest in this review is formal volunteering. Formal volunteering can be described as voluntary, on-going, planned, helping behaviour that intend to increase the well-being of strangers, offers no monetary compensation, and typically occurs within an organizational context (see the Background section). Informal ways of helping friends, neighbours, or relatives, such as running errands, providing transportation etc., which are typically motivated by an obligation to help intimate others, will be excluded. The comparison population are people who are not engaged in formal voluntary work. The primary focus is on measures of health. As primary outcomes we will include physical health outcomes as well as mental health outcomes. All measures of physical health outcomes reported in studies using a comparable control group have to be objective in order to be included as primary outcomes. As mentioned above Wilson and Musick (1999) highlight the problem with studies relying on self-assessment of the impact of volunteering. Self-assessment of health should not be confused with self-reported measures. By self-assessment we understand questions of the form: “Would you say the state of your health is excellent, good, fair, poor or very poor?” which will not be included as a primary outcome. On the other hand, we do not expect that measures of mental health outcomes be obtained via structured clinical interviews. Instead, we expect that self-reported questionnaires be used to screen for probable mental disorders. The use of different instruments of detection may be an important source of variation for the incidence of measured mental health outcomes. Measures of health have to be standardized to be included, see below. General scales of well-being will be included if they are measured by standardized psychological symptom measures. Examples of physical health primary outcomes include mortality, time until the onset of a serious disease (as for example a heart attack, stroke, cancer, arthritis), functional disability (measured by a standardized physical ability measure such as a difficulties in activities of daily living score (ADLs, see Katz et al., 1963)), or a difficulties in instrumental activities of daily living score (IADLs, see Lawton and Brody, 1969). Examples of mental health primary outcomes include depression, anxiety and mental health-related disability measured by standardized psychological symptom measures such as the Center for Epidemiological Studies Depression Scale (CES-D), the Hopkins Symptom Checklist and the Medical Outcomes Study – Short Form. Although some researchers express concerns about using self-assessed health measures (Wilson & Musick, 1999 and Lum & Lightfood, 2005), others argue that self-assessed health can be a good predictor of mortality (Jylhä, 2009 and Jylhä, Volpato & Guralnik, 2006). If studies report self-assessment of health, using questions of the form: “Would you say the state of your health is excellent, good, fair, poor or very poor?” they will be included as secondary outcomes. Time points for measures considered will be: The volunteer work may be done in all organisational contexts such as religious organisations, educational organisations, health organisations political groups, sports clubs, cultural organisations, senior citizen groups or related organisations. Activities performed by individuals who, of their own accord, engage in the sustained, non-obligated helping of strangers will, however, also be included. We are aware that it may be difficult to distinguish such activities from informal ‘helping out’. New forms of organising volunteer work are coming. An example of such an activity to be included (i.e., that is more than the informal helping out between friends and family members) is the on-going volunteer work done under the auspices of ‘Venligboerne’ in Denmark. Venligboerne is an initiative that is managed by the civil society and people are linked together by a common identity and a common goal of creating an inclusive community for refugees. People can arrange, organise and volunteer in local4 joint initiatives such as establishing a café at the asylum centre to arrange large celebrations of festive seasons without being framed by an organisation with given structure (for more information see Kelstrup, 2016). If relevant studies of this kind of voluntary activities are identified, they will be analysed separately. Relevant studies will be identified through searches in electronic databases, governmental and grey literature repositories, hand search in specific targeted journals, citation tracking, contact to international experts and internet search engines. Following international databases will be searched: Following grey literature resources will be searched: UK Institute for Volunteering Research - https://www.ncvo.org.uk/institute-for-volunteering-research Danish Institute for Voluntary Effort - https://frivillighed.dk/ Danish National Research Database - http://www.forskningsdatabasen.dk/en Volunteer Bénévoles/Canada - https://volunteer.ca/research-resources Corporation for National and Community Service - https://www.nationalservice.gov/ Royal Voluntary Service - https://www.royalvoluntaryservice.org.uk/ NBER Working Papers - http://www.nber.org/papers.html Open Grey - http://opengrey.eu/ Google Scholar (specific for grey literature) - https://scholar.google.com Google searches - https://scholar.google.com/ Further sources of grey literature might be added throughout the search process. Five specific journals will be hand-searched: The Journals of Gerontology American Journal of Public Health Gerontologist International Journal of Geriatric Psychiatry Journal of Applied Gerontology In order to identify both published studies and grey literature we will utilize citation-tracking/snowballing strategies. Our primary strategy will be to citation-track related systematic-reviews and meta-analyses. The review team will also check reference lists of included primary studies for new leads. We will contact international experts to identify unpublished and ongoing studies. Down below is an example of a search string used to search PsycINFO. The search string will be modified accordingly to fit each database listed. Terms used for the grey literature search will be based on the general search strategy. Combinations of terms such as “volunteer*” with terms for the population (i.e. old or elderly people) or the outcome terms (i.e. health outcomes) will be utilised. We do not expect to find any randomised controlled trials. Studies of the effect of voluntary work are required to have a control group for inclusion in the review. An example of a study that may be included is Musick and Wilson (2003) who compared a group of volunteers with a group of non-volunteers of same age. They controlled for a range of socio demographic variables and in addition included two health variables as control factors. A physical health measure assessed functional impairment and the second health measure was a sum of life threatening conditions such as heart attack, stroke, lung disease, diabetes, and cancer. We will take into account the unit of analysis of the studies to determine whether individuals were randomised in groups (i.e. cluster-randomised trials), whether individuals may have undergone multiple interventions, whether there were multiple treatment groups and whether several studies are based on the same data source. Cluster randomised trials included in this review will be checked for consistency in the unit of allocation and the unit of analysis, as statistical analysis errors can occur when they are different. When appropriate analytic methods have been used, we will meta-analyse effect estimates and their standard errors (Higgins & Green, 2011). In cases where study investors have not applied appropriate analysis methods that control for clustering effects, we will estimate the intra-cluster correlation (Donner, Piaggio, & Villar, 2001; Hedges, 2007b) and correct standard errors. Studies with multiple intervention groups with different individuals will be included in this review, although only intervention and control groups that meet the eligibility criteria will be used in the data synthesis. To avoid problems with dependence between effect sizes we will apply robust standard errors (Hedges, Tipton, & Johnson, 2010) and use the small sample adjustment to the estimator itself (Tipton, 2015). We will use the results in Tanner-Smith & Tipton (2014) and Tipton (2015) to evaluate if there are enough studies for this method to consistently estimate the standard errors. See Section Data Synthesis below for more details about the data synthesis. If there are not enough studies, we will use a synthetic effect size (the average) in order to avoid dependence between effect sizes. This method provides an unbiased estimate of the mean effect size parameter but overestimates the standard error. Random effects models applied when synthetic effect sizes are involved actually perform better in terms of standard errors than do fixed effects models (Hedges, 2007a). However, tests of heterogeneity when synthetic effect sizes are included are rejected less often than nominal. If pooling is not appropriate (e.g., the multiple interventions and/or control groups include the same individuals), only one intervention group will be coded and compared to the control group to avoid overlapping samples. The choice of which estimate to include will be based on our risk of bias assessment. We will choose the estimate that we judge to have the least risk of bias (primarily, selection bias and in case of equal scoring the incomplete data item will be used). In some cases, several studies may have used the same sample of data or some studies may have used only a subset of a sample used in another study. We will review all such studies, but in the meta-analysis we will only include one estimate of the effect from each sample of data. This will be done to avoid dependencies between the “observations” (i.e. the estimates of the effect) in the meta-analysis. The choice of which estimate to include will be based on our risk of bias assessment of the studies. We will choose the estimate from the study that we judge to have the least risk of bias (primarily, selection bias). If two (or more) studies are judges to have the same risk of bias and one of the studies (or more) uses a subset of a sample used in another study (or studies) we will include the study using the full set of participants. When the results are measured at multiple time points, each outcome at each time point will be analysed in a separate meta-analysis with other comparable studies taking measurements at a similar time point. As a general guideline, these will be grouped together as follows: 1) While actively engaged in voluntary work 2) At cessation of volunteering to one year after cessation of volunteering 3) More than one year after cessation of volunteering. However, should the studies provide viable reasons for an adjusted choice of relevant and meaningful duration intervals for the analysis of outcomes, we will adjust the grouping. Selection of studies and data extraction Under the supervision of review authors, two review team assistants will first independently screen titles and abstracts to exclude studies that are clearly irrelevant. Studies considered eligible by at least one assistant or studies were there is insufficient information in the title and abstract to judge eligibility, will be retrieved in full text. The full texts will then be screened independently by two review team assistants under the supervision of the review authors. Any disagreement of eligibility will be resolved by the review authors. Exclusion reasons for studies that otherwise might be expected to be eligible will be documented and presented in an appendix. The study inclusion criteria will be piloted by the review authors (see Appendix 1.1). The overall search and screening process will be illustrated in a flow diagram. None of the review authors will be blind to the authors, institutions, or the journals responsible for the publication of the articles. Two review authors will independently code and extract data from included studies. A coding sheet will be piloted on several studies and revised as necessary (see Appendix 1.2 and 1.3). Disagreements will be resolved by consulting a third review author with extensive content and methods expertise. Disagreements resolved by a third reviewer will be reported. Data and information will be extracted on: available characteristics of participants, intervention characteristics and control conditions, research design, sample size, risk of bias and potential confounding factors, outcomes, and results. Extracted data will be stored electronically. Analysis will be conducted using RevMan5 and Stata software. We will assess the risk of bias using a model developed by Prof. Barnaby Reeves in association with the Cochrane Non-Randomised Studies Methods Group (Reeves, Deeks, Higgins, & Wells, 2011).5 This model is an extension of the Cochrane Collaboration's risk of bias tool and covers risk of bias in non-randomised studies that have a well-defined control group. The extended model is organised and follows the same steps as the risk of bias model according to the 2008-version of the Cochrane Hand book, chapter 8 (Higgins & Green, 2008). The extension to the model is explained in the three following points: The refined assessment is pertinent when considering data synthesis as it operationalizes the identification of those studies with a very high risk of bias (especially in relation to non-randomised studies). The refinement increases transparency in assessment judgements and provides justification for excluding a study with a very high risk of bias from the data synthesis. The risk of bias model used in this review is based on nine items (see Appendix 1.3). The nine items refer to: In the 5-point scale, 1 corresponds to Low risk of bias and 5 corresponds to High risk of bias. A score of 5 on any of the items assessed on the 5-point scale translates to a risk of bias so high that the findings will not be considered in the data synthesis (because they are more likely to mislead than inform). An important part of the risk of bias assessment of non-randomised studies is consideration of how the studies deal with confounding factors (see Appendix 1.3). Selection bias is understood as systematic baseline differences between groups which can therefore compromise comparability between groups. Baseline differences can be observable (e.g. age and gender) and unobservable (to the researcher; e.g. motivation and ‘ability’). There is no single non-randomised study design that always solves the selection problem. Different designs represent different approaches to dealing with selection problems under different assumptions, and consequently require different types of data. There can be particularly great variations in how different designs deal with selection on unobservables. The “adequate” method depends on the model generating participation, i.e. assumptions about the nature of the process by which participants are selected into a programme. A major difficulty in estimating causal effects of voluntary work is the potential endogeneity of the individual's health condition th